t-SNE overview
Basic information
t-SNE: t-distributed stochastic neighbor embedding
Method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map
Computational complexity of direct implementation:
Original paper - Laurens van der Maaten, Geoffrey Hinton, 2008
T-sne is based on Stochastic Neighbor Embedding originally developed by Sam Roweis and Geoffrey Hinton, where Laurens van der Maaten proposed the t-distributed variant. It is a nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.
| Advantages | Disadvantages |
|---|---|
| Handles non-linear data efficiently | Computationally Complex |
| Preserves local and global structure | Non-deterministic |
| Requires Hyperparameter Tuning |
Algorithm
Example dataset
Let’s assume that we have 5 datapoints (“A”, “B”, “C”, “D”, “E”) embedded in 3-dimensional space. Our objective is to embed these points in the 2-dimensional space so that the local structure is preserved.
| x | y | z | |
|---|---|---|---|
| A | 1.00 | 0.90 | 0.10 |
| B | 1.10 | 1.00 | 0.20 |
| C | 1.05 | 1.05 | -0.10 |
| D | 1.50 | 1.02 | 0.10 |
| E | 1.40 | 1.01 | 0.15 |
Step 1 - calculate the distances between points
Firstly we calculate the euclidean distances between the points in the high-dimensional space, i.e.:
| A | B | C | D | E | |
|---|---|---|---|---|---|
| A | 0.000 | 0.173 | 0.255 | 0.514 | 0.418 |
| B | 0.173 | 0.000 | 0.308 | 0.413 | 0.304 |
| C | 0.255 | 0.308 | 0.000 | 0.493 | 0.432 |
| D | 0.514 | 0.413 | 0.493 | 0.000 | 0.112 |
| E | 0.418 | 0.304 | 0.432 | 0.112 | 0.000 |
Step 2 - calculate the probability distribution in high-dim space
For each point in our dataset we calculate the similarity to other points, which is described in the original paper as follows: similarity of datapoint to datapoint is the conditional probability that would pick as its neighbor. Note that definition is not symmetric, i.e. .
Let’s pick one point, e.g. “A” and calculate the aforementioned conditional probabilities with respect to point “A”:
| x | |
|---|---|
| Dist A-A | 0.000 |
| Dist A-B | 0.173 |
| Dist A-C | 0.255 |
| Dist A-D | 0.514 |
| Dist A-E | 0.418 |
We map the distances to the Gaussian distribution with mean = 0. The probabilities that point “A” will pick the point “B” as its neighbor is equal to the value of density of normal distribution in point “Dist A-B”:
Additionally, we assume that by definition. After we calculate the densities for all points, we normalize the values in order to have the probability distribution. Overall, the probability that point will pick point is given by the formula:
The remaining issue is the algorithm of determining the sigma values in the normal distribution (c.f. Appendix 1).
| Point A | Point B | Point C | Point D | Point E | |
|---|---|---|---|---|---|
| distance (euclidian) | 0 | 0.173 | 0.255 | 0.514 | 0.418 |
| probability | 0 | 0.752 | 0.701 | 0.470 | 0.563 |
| norm. probability | 0 | 0.302 | 0.282 | 0.189 | 0.226 |
Now, we have the similarity of datapoint “A” to other datapoints defined as the conditional probability ( would pick as it neighbor):
Let’s calculate all 5 distributions (one distribition per one datapoint):
| Point A | Point B | Point C | Point D | Point E | |
|---|---|---|---|---|---|
| P_A | 0.000 | 0.302 | 0.282 | 0.189 | 0.226 |
| P_B | 0.285 | 0.000 | 0.250 | 0.215 | 0.251 |
| P_C | 0.292 | 0.275 | 0.000 | 0.204 | 0.229 |
| P_D | 0.204 | 0.246 | 0.213 | 0.000 | 0.337 |
| P_E | 0.221 | 0.260 | 0.215 | 0.305 | 0.000 |
Step 3 - calculate probability distribution in low-dim space
Let’s do exactly the same thing in low-dimensional space by performing the following steps:
- we randomly select the coordinates of all datapoints in low-dimensional space
- instead of normal distribution we apply t-student distribution
| Point A | Point B | Point C | Point D | Point E | |
|---|---|---|---|---|---|
| distance (euclidian) | 0 | 0.130 | 0.366 | 0.248 | 0.227 |
| probability | 0 | 0.771 | 0.610 | 0.706 | 0.720 |
| norm. probability | 0 | 0.275 | 0.217 | 0.252 | 0.257 |
Overall, the probability, that point i will pick point j is given by the formula:
Step 4 - train like a supervised model
Now, for each point we have two distributions: in low-dim space (Q) and high-dim space (P).
| Point A | Point B | Point C | Point D | Point E | |
|---|---|---|---|---|---|
| P_A | 0.000 | 0.302 | 0.282 | 0.189 | 0.226 |
| Q_A | 0.000 | 0.275 | 0.217 | 0.252 | 0.257 |
| P_B | 0.285 | 0.000 | 0.250 | 0.215 | 0.251 |
| Q_B | 0.254 | 0.000 | 0.232 | 0.256 | 0.258 |
| P_C | 0.292 | 0.275 | 0.000 | 0.204 | 0.229 |
| Q_C | 0.216 | 0.249 | 0.000 | 0.268 | 0.267 |
| P_D | 0.204 | 0.246 | 0.213 | 0.000 | 0.337 |
| Q_D | 0.232 | 0.256 | 0.250 | 0.000 | 0.262 |
| P_E | 0.221 | 0.260 | 0.215 | 0.305 | 0.000 |
| Q_E | 0.236 | 0.256 | 0.247 | 0.261 | 0.000 |
Now, the idea is to change to coordinates in low-dim space so that the distributions are similar to the ones so that we can reflect the local structures of high-dim space in low-dim space. Effectively, we tackle the problem of finding optimizing the parameters (just like in supervised model):
- Parameters to be trained: coordinates of points in low-dimensional space
- Actual values: P-distributions
- Optimizer: stochastic gradient descent
- Loss function: Kullback–Leibler divergence:
or, using the previous notations (for point A):
For all points in the dataset:
In practise, we can also define and change the loss function to the folowing form:
where and .
Final output
From technical reasons the original dataset was extended by adding some noise.
The results in 2-dim space are the following:
Examples
Iris
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
|---|---|---|---|---|
| 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 5.0 | 3.6 | 1.4 | 0.2 | setosa |
| 5.4 | 3.9 | 1.7 | 0.4 | setosa |